The purpose of this document is to demonstrate the types of analyses required for each of the “big data systems”. The examples here will be specified using the tidyverse, but any of the big data systems should cover similar content.

Required Packages

We first need to go over the packages that are required for the data system. For example, the “Tidyverse” contains the “tidyverse” package (which installs and loads the entire tidyverse system). Let’s install the tidyverse and other packages required for this section.

cur_pkgs  <- rownames(installed.packages())
tidy_pkgs <- c("tidyverse",
               "here",
               "vctrs",
               "fs",
               "vroom",
               "magrittr")

# determine packages that are missing
miss_pkgs <- setdiff(x = tidy_pkgs,
                     y = cur_pkgs)

# installing missing packages
if(length(miss_pkgs)){
  install.packages(miss_pkgs,
                   repos = "https://cran.rstudio.com/")
}

Loading the tidyverse package provides two pieces of information:

  1. What “sub-packages” are part of the tidyverse system?
  2. What functions conflict with currently loaded functions?
# load the tidyverse packages
library(magrittr)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract()   masks magrittr::extract()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()

We can see that the tidyverse contains the following packages:

  1. tibble: a reimagining of the data.frame. Tibbles are like data.frames but they have a different method of display, have some tidyverse benefits (like using rlang), and yell at you if you try to take a shortcut to doing anything.
  2. forcats: functions to work with factors.
  3. stringr: functions to work with string process (essentially wrappers of the stringi package, which is an alternative to string processing functions in base R).
  4. readr: functions to read data into R and turn that data into tibbles.
  5. tidyr: functions for putting data into forms useful for summarization/aggregation.
  6. dplyr: functions designed to summarize/aggregate/manipulate data.
  7. purrr: functions for “functional programming” (map, reduce, filter). These are essentially (slow) alternatives to sapply, vapply, lapply, mapply functions. 8 ggplot2: functions for graphing in an alternate system than base R.

Note that the conflicts indicates that once you load the tidyverse, the filter from dplyr supercedes the filter from stats. If you want to use the filter from the stats package, you would then need to fully condition the function (e.g., use stats::filter as the function name rather than filter).

You can load each of the tidyverse packages by itself rather than the whole group of packages at once, but you can think of the Tidyverse being a framework and tidyverse package loading all dependencies required with working in that framework.

These are not the “only” packages in the tidyverse. There are tidyverse-dependent packages (rlang, vctrs, magrittr, crayon, cli, pillar, glue, tidyselect), tidyverse-imporing packages (vroom, readxl, haven, jsonlite, xml2, httr, rvest, DBI), tidyverse-manipulation packages (lubridate, hms, blob). There is a separate whole area for package development (Tidy Development, which includes devtools and everything that devtools depends on) and modeling (tidymodels and everything that tidymodels depends on, including broom, infer, modelr, recipes, rsample, and yardstick). Note that the tidymodels package has the same setup as the tidyverse package.

You can see whether all of the dependent tidyverse packages are out of date:

# check to see if any dependencies are out of date
tidyverse::tidyverse_update(recursive = TRUE,
                            repos     = "https://cran.rstudio.com/")
## The following packages are out of date:
## 
## ● knitr (1.26 -> 1.27)
## 
## Start a clean R session then run:
## install.packages("knitr")

And see which packages count as part of the tidyverse

tidyverse::tidyverse_packages(include_self = FALSE)
##  [1] "broom"      "cli"        "crayon"     "dbplyr"     "dplyr"     
##  [6] "forcats"    "ggplot2"    "haven"      "hms"        "httr"      
## [11] "jsonlite"   "lubridate"  "magrittr"   "modelr"     "pillar"    
## [16] "purrr"      "readr"      "readxl"     "reprex"     "rlang"     
## [21] "rstudioapi" "rvest"      "stringr"    "tibble"     "tidyr"     
## [26] "xml2"

Loading Data

If your data is in “.csv” format, there are two packages to load your data into R using the tidyverse: readr and vroom. If you are in an “R Project” in RStudio, the here package of the tidyverse will indicate the upper-most directory of the project, and the fs package of the tidyverse can create file paths.

# pull project directory
project_dir <- here::here()

# combine file names to pull paths
data_files  <- vctrs::vec_c(
   demos  = "demos_to_merge",
   scores = "scores_to_merge",
   comb   = "data_to_rowbind"
)

# bind directory with file names to create file paths
data_paths  <- rlang::set_names(
   x  = fs::path(project_dir, "data", data_files, ext = "csv"),
   nm = names(data_files)
)

# can check to make sure the files exist before reading them in
print(fs::file_exists(data_paths))
##  /Users/nydicks/OneDrive - Korn Ferry/Documents/Projects/Workshops and Training/Workshops/2020/siop-2020-big-data-systems/data/demos_to_merge.csv 
##                                                                                                                                              TRUE 
## /Users/nydicks/OneDrive - Korn Ferry/Documents/Projects/Workshops and Training/Workshops/2020/siop-2020-big-data-systems/data/scores_to_merge.csv 
##                                                                                                                                              TRUE 
## /Users/nydicks/OneDrive - Korn Ferry/Documents/Projects/Workshops and Training/Workshops/2020/siop-2020-big-data-systems/data/data_to_rowbind.csv 
##                                                                                                                                              TRUE

You can use read_csv from the readr package or vroom from the vroom package to read the files into R.

# method one of reading files
tidy_data_1 <- map(.x = data_paths,
                   .f = read_csv)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   guid = col_character(),
##   data_level = col_character(),
##   data_industry = col_character(),
##   data_function = col_character()
## )
## Warning: Missing column names filled in: 'X1' [1]
## Warning: Duplicated column names deduplicated: 'X1' => 'X1_1' [3]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   guid = col_character(),
##   X1_1 = col_double(),
##   X2 = col_double(),
##   X3 = col_double(),
##   X4 = col_double(),
##   X5 = col_double(),
##   X6 = col_double(),
##   X7 = col_double(),
##   X8 = col_double(),
##   X9 = col_double(),
##   X10 = col_double()
## )
## Warning: Missing column names filled in: 'X1' [1]
## Warning: Duplicated column names deduplicated: 'X1' => 'X1_1' [6]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   guid = col_character(),
##   data_level = col_character(),
##   data_industry = col_character(),
##   data_function = col_character(),
##   X1_1 = col_double(),
##   X2 = col_double(),
##   X3 = col_double(),
##   X4 = col_double(),
##   X5 = col_double(),
##   X6 = col_double(),
##   X7 = col_double(),
##   X8 = col_double(),
##   X9 = col_double(),
##   X10 = col_double()
## )
# this is equilvalent to:
# tidy_data_1 <- list(
#   demos  = read_csv(data_files[1]),
#   scores = read_csv(data_files[2]),
#   comb   = read_csv(data_files[3])
# )

Note that packages in the tidyverse tend to be chatty. If you don’t specify something (and it has to make a guess), it will probably tell you what it guessed. Here you can specify the column types directly using the col_types argument. If you don’t specify them, read_csv will try to guess whether they’re double (col_double), character (col_character), logical (col_logical) or something else. Moreover, read_csv will never convert character strings to factors unless you specify this manually and will always convert the data into a tibble.

print(tidy_data_1[[1]])
## # A tibble: 545 x 5
##       X1 guid       data_level           data_industry     data_function        
##    <dbl> <chr>      <chr>                <chr>             <chr>                
##  1 23456 dXIOWn1q3p Individual Contribu… Communications    Human Resources      
##  2 23457 fI8ZSae2QY Supervisor           Communications    Sales                
##  3 23458 2laJn7v5KY Individual Contribu… Retail            Administrative Servi…
##  4 23459 SwnvIM48C6 Individual Contribu… Healthcare        Operations           
##  5 23460 3Z4PW7HeQj Individual Contribu… Consumer Goods    Administrative Servi…
##  6 23461 7mTQBMsyzZ Individual Contribu… Consumer Goods    Administrative Servi…
##  7 23462 luxwZnq6oP Individual Contribu… Healthcare        Marketing            
##  8 23463 sjN92UPinR Functional Leader    Financial Servic… Sales                
##  9 23464 JsE8oj62vb Individual Contribu… Financial Servic… General Management   
## 10 23465 sxKzkhN4eR Individual Contribu… Consumer Goods    Sales                
## # … with 535 more rows

Notice also that read_csv converted row names to an X1 column in the data, changed any X1 column to something else (namely X1_1). There is no easy way to change this behavior (without specifying all of the columns to read in), but we can certainly fix it!

# a function to remove the first column and rename any X1_1 column as X1
remove_rowname_column <- function(tbl){
   tbl <- tbl %>%
          select(-1)
   
   # there are better ways of doing this that are more general
   if("X1_1" %in% names(tbl)){
      tbl <- tbl %>%
             rename(X1 = X1_1)
   }
   
   tbl
}

# removing the rownames column
tidy_data_1 <- map(.x = tidy_data_1,
                   .f = remove_rowname_column)

Note that we can also read in the data using the vroom function. This function is similar to read_csv but faster (and has additional storage capabilities in R). Notice that vroom::vroom has an argument, col_select, that allows us to drop the first column of the data (the column indicating the rowname) without applying the earlier function.

# method two of reading files
tidy_data_2 <- map(.x = data_paths,
                   .f = vroom::vroom,
                   col_select = -1)
## Rows: 545
## Columns: 4
## Delimiter: ","
## chr [4]: guid, data_level, data_industry, data_function
## 
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
## Rows: 545
## Columns: 11
## Delimiter: ","
## chr [ 1]: guid
## dbl [10]: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
## 
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
## Rows: 250
## Columns: 14
## Delimiter: ","
## chr [ 4]: guid, data_level, data_industry, data_function
## dbl [10]: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
## 
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
# this is equilvalent to:
# tidy_data_2 <- list(
#   demos  = vroom::vroom(data_files[1], col_select = -1),
#   scores = vroom::vroom(data_files[2], col_select = -1),
#   comb   = vroom::vroom(data_files[3], col_select = -1)
# )

Note that the data looks the same as it did before.

print(tidy_data_2[[1]])
## # A tibble: 545 x 4
##    guid       data_level             data_industry      data_function          
##    <chr>      <chr>                  <chr>              <chr>                  
##  1 dXIOWn1q3p Individual Contributor Communications     Human Resources        
##  2 fI8ZSae2QY Supervisor             Communications     Sales                  
##  3 2laJn7v5KY Individual Contributor Retail             Administrative Services
##  4 SwnvIM48C6 Individual Contributor Healthcare         Operations             
##  5 3Z4PW7HeQj Individual Contributor Consumer Goods     Administrative Services
##  6 7mTQBMsyzZ Individual Contributor Consumer Goods     Administrative Services
##  7 luxwZnq6oP Individual Contributor Healthcare         Marketing              
##  8 sjN92UPinR Functional Leader      Financial Services Sales                  
##  9 JsE8oj62vb Individual Contributor Financial Services General Management     
## 10 sxKzkhN4eR Individual Contributor Consumer Goods     Sales                  
## # … with 535 more rows

Let’s save these datasets as separate objects in R to be used for further analysis.

tidy_demos  <- pluck(tidy_data_1, "demos")
tidy_scores <- pluck(tidy_data_1, "scores")
tidy_comb_1 <- pluck(tidy_data_1, "comb")

Note that we have three sets of data.

  1. tidy_demos are demographic data for some of the unique IDs in tidy_scores.
print(tidy_demos)
## # A tibble: 545 x 4
##    guid       data_level             data_industry      data_function          
##    <chr>      <chr>                  <chr>              <chr>                  
##  1 dXIOWn1q3p Individual Contributor Communications     Human Resources        
##  2 fI8ZSae2QY Supervisor             Communications     Sales                  
##  3 2laJn7v5KY Individual Contributor Retail             Administrative Services
##  4 SwnvIM48C6 Individual Contributor Healthcare         Operations             
##  5 3Z4PW7HeQj Individual Contributor Consumer Goods     Administrative Services
##  6 7mTQBMsyzZ Individual Contributor Consumer Goods     Administrative Services
##  7 luxwZnq6oP Individual Contributor Healthcare         Marketing              
##  8 sjN92UPinR Functional Leader      Financial Services Sales                  
##  9 JsE8oj62vb Individual Contributor Financial Services General Management     
## 10 sxKzkhN4eR Individual Contributor Consumer Goods     Sales                  
## # … with 535 more rows
  1. tidy_scores are scores on each of the variables for IDs in tidy_demos.
print(tidy_scores)
## # A tibble: 545 x 11
##    guid        X1     X2       X3      X4     X5     X6      X7      X8      X9
##    <chr>    <dbl>  <dbl>    <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
##  1 dXIO…  0.715    0.290  0.573   -0.794  -1.36  -0.573  0.228  -0.0798  0.740 
##  2 fI8Z…  1.17     0.131 -0.910   -0.475  -0.836 -0.317 -0.0124  0.529   1.74  
##  3 2laJ… -1.46     0.169  0.677   -0.818   2.39  -0.609 -0.461  -0.0579  0.506 
##  4 Swnv…  1.42     0.211  1.53     2.29    0.126  0.681  0.486   0.587   0.116 
##  5 3Z4P…  0.454   -1.31   0.946   -0.357   1.22   0.293  1.25    2.18    1.78  
##  6 7mTQ…  0.924    0.816  1.33     0.905  -0.734 -0.419  0.364  -0.292   1.07  
##  7 luxw…  1.02     0.717  0.201    1.40    0.256  0.818  1.66    1.09   -1.02  
##  8 sjN9… -0.734   -0.244 -0.760   -0.865  -0.960  0.266 -0.201  -0.102  -0.0861
##  9 JsE8… -1.28    -0.743  0.00103 -1.13   -0.545 -0.260 -1.41   -2.45   -0.863 
## 10 sxKz… -0.00982  0.793 -0.472   -0.0497 -0.113 -0.194  1.84   -0.462   0.181 
## # … with 535 more rows, and 1 more variable: X10 <dbl>
  1. tidy_comb_1 are combination of data (demos and scores combined) for IDs not in tidy_demos or tidy_scores.
print(tidy_comb_1)
## # A tibble: 250 x 14
##    guid  data_level data_industry data_function     X1      X2       X3      X4
##    <chr> <chr>      <chr>         <chr>          <dbl>   <dbl>    <dbl>   <dbl>
##  1 R5z6… Individua… Financial Se… General Mana…  1.91  -1.08    0.626   -0.255 
##  2 V14B… Individua… Communicatio… Human Resour…  0.611  1.20   -0.0607   0.974 
##  3 hT6t… Supervisor Communicatio… General Mana…  1.69  -0.0915  1.90     0.106 
##  4 9geR… Functiona… Communicatio… Operations     0.666  0.0865  0.781    0.0952
##  5 JeN4… Individua… Retail        Operations    -1.23  -0.720  -0.461    0.396 
##  6 WqQg… Supervisor Communicatio… Operations    -0.586 -0.337   0.498   -0.757 
##  7 rKo3… Supervisor Consumer Goo… Operations     0.503  0.552   0.599    0.983 
##  8 OQ6w… Mid-level… Healthcare    Administrati… -0.575  0.109  -0.00838 -0.605 
##  9 6qBT… Functiona… Communicatio… Marketing      0.793 -1.26   -0.650    0.208 
## 10 qirL… Supervisor Retail        Marketing      2.51  -0.988  -0.209    1.09  
## # … with 240 more rows, and 6 more variables: X5 <dbl>, X6 <dbl>, X7 <dbl>,
## #   X8 <dbl>, X9 <dbl>, X10 <dbl>

Merging Data

The Tidyverse package dplyr contains several functions to merge two tibbles. Unlike base-R, which has one function to do all of the joining, the Tidyverse tries to make individual functions for specific goals (with each function having a different, descriptive name).

  1. inner_join: keep rows of x and rows of y that have the same elements of the by variable. Named after the SQL join “INNER JOIN” or “JOIN”.
  2. left_join: keep all rows of x and ONLY rows of y that have by elements in x. Named after the SQL join “LEFT JOIN” or “LEFT OUTER JOIN”.
  3. right_join: keep all rows of y and ONLY rows of x that have by elements in y. Named after the SQL join “RIGHT JOIN” or “RIGHT OUTER JOIN”.
  4. full_join: keep all rows of x and all rows of y. Named after the SQL join “FULL JOIN” or “FULL OUTER JOIN”.
  5. semi_join: keep rows of x that have by elements in y. Named after the SQL join “LEFT SEMI JOIN”.
  6. anti_join: keep rows of x that do not have by elements in y. Named after the SQL join “LEFT ANTI JOIN”.
  7. nest_join: keep all rows of x and put rows of y that have a particular by element as an element of a “tibble” column.

To see how these work, let’s remove some rows from tidy_demos and tidy_scores. Lets sort the rows ahead of time so that the joins return data in the same order. Note that merging in the Tidyverse doesn’t automatically sort the rows unlike base R functions.

# sort dfs by guid
tidy_demos    <- tidy_demos %>%
                 arrange(guid)
tidy_scores   <- tidy_scores %>%
                 arrange(guid)

# remove a few odd rows from tidy_demos and even rows from tidy_scores
tidy_demos_1  <- tidy_demos %>%
                 mutate(row_n = row_number()) %>%
                 slice(-vctrs::vec_c(1, 3, 5)) %>%
                 select(row_n, everything())

tidy_scores_1 <- tidy_scores %>%
                 mutate(row_n = row_number()) %>%
                 slice(-vctrs::vec_c(2, 4)) %>%
                 select(row_n, everything())

Notice that the row variable is in order and can tell us what is going on. Also note that we removed rows 1, 3, 5 from the demographics data and 2, 4 from the scores data.

tidy_demos_1
tidy_scores_1

What happens when we apply each of the standard join types?

id_var <- vctrs::vec_c("row_n", "guid")
inner_join(x  = tidy_demos_1,
           y  = tidy_scores_1,
           by = id_var) %>%
   arrange(row_n)
left_join(x  = tidy_demos_1,
          y  = tidy_scores_1,
          by = id_var) %>%
   arrange(row_n)
right_join(x  = tidy_demos_1,
           y  = tidy_scores_1,
           by = id_var) %>%
   arrange(row_n)
full_join(x  = tidy_demos_1,
          y  = tidy_scores_1,
          by = id_var) %>%
   arrange(row_n)

Note that the inner_join kept only row that was in both tables (6, …), left_join kept row that was in the first table but not the second (2, 4, 6, …), right_join kept row that was in the second table but not the first (1, 3, 5, 6, …), and full_join kept all rows. In all cases, all of the variables were kept and data missing data (missing rows) for all variables were set to NA.

Filtering joins (semi_join and anti_join) are a quick way of keeping rows in the data by using “join” terminlogy rather than filtering data directly.

# keep all tidy_demos_1 rows with equivalents in tidy_scores_1
out <- semi_join(x  = tidy_demos_1,
                 y  = tidy_scores_1,
                 by = id_var)
out
names(out)
## [1] "row_n"         "guid"          "data_level"    "data_industry"
## [5] "data_function"
# keep all tidy_demos_1 rows without equivalents in tidy_scores_1
out <- anti_join(x  = tidy_demos_1,
                 y  = tidy_scores_1,
                 by = id_var)
out
names(out)
## [1] "row_n"         "guid"          "data_level"    "data_industry"
## [5] "data_function"

Note that semi_join was similar to inner_join but kept ONLY rows matching rows in y. Nested joins are more complicated and beyond the scope of this demostration.

Using the original data, because the rows are the same, using any of the combining joins will return the same data. If that’s the case, the fastest should be the inner_join.

id_var      <- "guid"
tidy_comb_2 <- inner_join(x  = tidy_demos,
                          y  = tidy_scores,
                          by = id_var)

Finally, we have additional data (tidy_comb_1) that we want to combine with the merged data. We can bind everything together using bind_rows. The simplest way to use the function is in the same way that you use rbind.

tidy_comb   <- bind_rows(tidy_comb_1,
                         tidy_comb_2)
tidy_comb

If your data is in a list format, the bind_rows function will also work.

tidy_comb <- list(tidy_comb_1,
                  tidy_comb_2) %>%
             bind_rows()
tidy_comb

If you add an .id argument to bind_rows, bind_rows will add an extra column to indicate whether the data came from.

bind_rows(tidy_comb_1,
          tidy_comb_2,
          .id = "source")
bind_rows(blah = tidy_comb_1,
          blee = tidy_comb_2,
          .id  = "source")

Unlike rbind, bind_rows will work if your data does not have the same columns and simply fill the missing places with NA.

df_1 <- select(tidy_comb_1,
               1, 3, 5) %>%
        slice(1, 2)
df_1
df_2 <- select(tidy_comb_2,
               1, 2, 4, 6, 7) %>%
        slice(1, 2)
df_2

What happens when we combine the data together?

bind_rows(df_1,
          df_2)

However, dplyr will not convert columns, so if one of your columns is a character string in one dataset and numeric in another, bind_rows will error.

# change the type of data
df_1 <- tidy_comb_1 %>%
        mutate(X1 = vctrs::vec_cast(x  = X1,
                                    to = "character"))
df_2 <- tidy_comb_2

# try to bind rows together
safely(.f = bind_rows)(df_1, df_2)$error
## <Rcpp::exception: Column `X1` can't be converted from character to numeric>

Reshaping Data

Currently the data is in wide format with each item (X1-X10) a column of the data. Hadley Wickham defined Tidy data (e.g., “https://vita.had.co.nz/papers/tidy-data.pdf”) as “Each variable forms a column; each observation forms a row; each type of observational unit forms a table” (Wickham, p. 4). Note that in the table, guid, data_level, data_industry, and data_function are clearly separate variables (so should form separate columns). One could argue that each of X1-X10 are also separate variables (they are separate items or separate constructs). However they also might be a single item (normed_score) that aligns with a separate variable (trait). If we put the data in this format, it is called “long format”.

There are several functions for translating data from “wide” format to “long” format in tidyr. The legacy methods are gather (turn wide-data into long-data) and spread (turn long-data into wide-data):

var_name       <- "trait"        # the category variable made from the columns
val_name       <- "normed_score" # the numeric variable made from the values
tidy_comb_long <- gather(tidy_comb,
                         key   = !!var_name,
                         value = !!val_name,
                         matches("^X[0-9]+$"))
tidy_comb_wide <- spread(tidy_comb_long,
                         key    = !!var_name,
                         value  = !!val_name)

# check if we've reversed everything
all_equal(target  = tidy_comb_wide,
          current = tidy_comb)
## [1] TRUE

Note that the rows and columns aren’t necessarily in the same order, but we have reversed the operation.

tidy_comb_long
tidy_comb_wide

In this case, gather takes data in wide format and makes it in long format and spread takes data in long format and makes it in wide format. For both functions, the key argument is the variable in long format that matches the names of the variables in wide format, and value is the variable in long format that matches the variable values in wide format. An easy way to remember this is “key is the top of your data, and value indicates the entries”. For gather, you also need to indicate which variables you want to turn into the “key” and “value” columns. Anything that you can do in select, you can do here:

  1. Select raw variable names
gather(tidy_comb,
       key   = !!var_name,
       value = !!val_name,
       X1, X2, X3, X4, X5, X6, X7, X8, X9, X10)
  1. Remove raw variable names
gather(tidy_comb,
       key   = !!var_name,
       value = !!val_name,
       -guid, -data_level, -data_industry, -data_function)
  1. Use dplyr::select helper functions: matches for pattern matching, starts_with, ends_with, contains for finding specific strings, num_range for finding the actual range of values, one_of for character vectors, etc.

Note also that for these functions key and value take raw variable names (either unquoted, such as trait or quoted such as “trait”). Because these functions take raw variable names, if you want to use a variable name saved in an object (such as var_name <- "trait"), you need to use “tidy unquoting methods”. Essentially, this means putting two exclamation points in front of the variable. In other words, in the tidy framework, !!x means “use whatever is contained in the variable x and not the name x itself.”

In current versions of tidyr, spread and gather are deprecated. The newer versions of those functions are called pivot_wider (matching spread) and pivot_longer (matching gather). We can accomplish the same results with these functions, but keep in mind that the application is a bit different.

tidy_comb_long_2 <- pivot_longer(tidy_comb,
                                 cols      = matches("^X[0-9]+$"),
                                 names_to  = var_name,
                                 values_to = val_name)
tidy_comb_wide_2 <- pivot_wider(tidy_comb_long_2,
                                names_from  = var_name,
                                values_from = val_name)

# check if we've reversed everything
all_equal(target  = tidy_comb_wide_2,
          current = tidy_comb_wide)
## [1] TRUE
all_equal(target  = tidy_comb_long_2,
          current = tidy_comb_long)
## [1] TRUE

Notice that they got rid of the weird quoting rules, so that names_to and values_to ONLY take quoted variable names. This prevents needing to use the odd “tidy escape syntax” in these functions. These functions can do much more complicated things than gather and spread. For instance, pivot_longer can

  1. Use a regular expression to remove text from the start of each variable name
pivot_longer(tidy_comb,
             cols      = matches("^X[0-9]+$"),
             names_to  = var_name,
             values_to = val_name,
             
             # remove the "X" in front of the variable names
             names_prefix = "^X")
  1. Break up the names of the columns into separate “variables”
pivot_longer(tidy_comb,
             cols      = matches("^X[0-9]+$"),
             values_to = val_name,
             
             # put the "X" in front of the variable names into a separate variable
             names_to  = c("prefix", var_name),
             names_sep = 1)
  1. Indicate prototypes for what the type of names/values in the final data.

If you really want to be creative with how to transform your data, you can use build_wider_spec and pivot_wider_spec or build_longer_spec and pivot_longer_spec. This allows you to build a specification for how to transform your dataset.

tidy_comb_spec <- build_longer_spec(tidy_comb,
                                    cols      = matches("^X[0-9]+$"),
                                    names_to  = var_name,
                                    values_to = val_name)
tidy_comb_spec

In this case, we have several columns:

  1. .name: the name of the variable that we are going to do something with.
  2. .value: for a given name, what variable does the corresponding value go into?
  3. Additional columns: for a given name, what is the value in additional columns.

Because this is a tibble, we can mutate, change, update, add things, and the final pivot_longer_spec will use the specification to determine what to do with the final columns.

tidy_comb_spec <- tidy_comb_spec %>%
                  separate(col  = "trait",
                           into = vctrs::vec_c("prefix", "trait_number"),
                           sep  = 1) %>%
                  mutate(.value          = inset(.value,
                                                 2,
                                                 value = "whoops"),
                         random_variable = letters[vctrs::vec_seq_along(prefix)])
tidy_comb_spec

Now we can apply this specification and X1 will have its “value” go into the variable “normed_score” with the values of “prefix” to be “X”, “trait_number” to be 1, and “random_variable” to be “a”. However, X2 will have its value go into the variable “whoops” with the values of “prefix” to be “X”, “trait_number” to be 2, and “random_variable” to be “b”.

pivot_longer_spec(data = tidy_comb,
                  spec = tidy_comb_spec) %>%
   select(-(guid:data_function))

So a specification is effectively “for the variable with this name, where does its value go, what other variables need to be created, and what are the values of those variables?” Note that pivot_longer_spec and pivot_wider_spec use the same specification.

all_equal(target  = tidy_comb,
          current = tidy_comb %>%
                    pivot_longer_spec(tidy_comb_spec) %>%
                    pivot_wider_spec(tidy_comb_spec,
                                     values_fn = list(whoops = first)))
## [1] TRUE

Technically, the values_fn shouldn’t be needed there, but the function is giving me “list” columns for the “whoops” variable (which seems to be a bug).

Aggregating Data